DataTransformerRegistry.enable('default')
Visualization (Exploring Co-variation)
Table of contents
- Categorical variable and a continuous variable
- Two categorical variables
- Two continuous variables
- Graphics for production
Categorical variable and continuous variable
Roadmap
Explore penguin color (continuous) vs. cut (categorical) * boxplot by category * densities by category (no summary at the end of this section)
from palmerpenguins import load_penguins
penguins = load_penguins()
display(penguins)| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
|---|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 339 | Chinstrap | Dream | 55.8 | 19.8 | 207.0 | 4000.0 | male | 2009 |
| 340 | Chinstrap | Dream | 43.5 | 18.1 | 202.0 | 3400.0 | female | 2009 |
| 341 | Chinstrap | Dream | 49.6 | 18.2 | 193.0 | 3775.0 | male | 2009 |
| 342 | Chinstrap | Dream | 50.8 | 19.0 | 210.0 | 4100.0 | male | 2009 |
| 343 | Chinstrap | Dream | 50.2 | 18.7 | 198.0 | 3775.0 | female | 2009 |
344 rows × 8 columns
continuous & categorical: box plot

continuous & categorical: mark_boxplot()
alt.Chart(penguins).mark_boxplot().encode(
x=alt.X('species:N', title="Species"),
y=alt.Y('body_mass_g:Q', title="Body Mass (g)"),
).properties(
width=400,
height=300
)Discussion question: what do you notice from this graph?
continuous & categorical: transform_density()
alt.Chart(penguins).transform_density(
'body_mass_g',
groupby=['species'],
as_=['body_mass_g', 'density']
).mark_line().encode(
alt.X('body_mass_g:Q'),
alt.Y('density:Q', stack=None),
alt.Color('species:N')
).properties(width=400,height=300)continuous & categorical: transform_density()
Discussion q – What if we required the x-axis range to include zero? Would that improve or reduce clarity? How come?
alt.Chart(penguins).transform_density(
'body_mass_g',
groupby=['species'],
as_=['body_mass_g', 'density']
).mark_line().encode(
alt.X('body_mass_g:Q', scale=alt.Scale(zero=True)),
alt.Y('density:Q', stack=None),
alt.Color('species:N')
).properties(width=400,height=300)continuous & categorical: transform_density() filled in
opacity=0.3 makes no difference in content; maybe a bit more elegant
alt.Chart(penguins).transform_density(
'body_mass_g',
groupby=['species'], # Group by species for different density curves
as_=['body_mass_g', 'density']
).mark_area(opacity=0.3).encode(
alt.X('body_mass_g:Q'),
alt.Y('density:Q', stack=None),
alt.Color('species:N')
).properties(width=400,height=300)Two categorical variables
Roadmap
- Exploring
diamonds(colorvprice) with a table - Plots: weighting by size vs. by color
Two continuous variables
Two continuous variables: roadmap
moviesratings from Rotten Tomatoes and IMDBdiamonds:caratvsprice
movies dataset
movies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'movies = pd.read_json(movies_url)Covariation: a first binned scatter plot
alt.Chart(movies_url).mark_circle().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
)Suffers from overplotting, but also is not very informative!
use alt.Size('count()') to address overplotting
xy_size = alt.Chart(movies_url).mark_circle().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Size('count()')
)
xy_sizeuse alt.Color('count()') to address overplotting
xy_color = alt.Chart(movies_url).mark_bar().encode(
alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
alt.Color('count()')
)
xy_colorDiscussion question
xy_size | xy_colorCompare the size and color-based 2D histograms above. Which encoding do you think should be preferred? Why?
Rectangle plot + color
alt.Chart(diamonds).mark_rect().encode(
alt.X('carat:Q', bin=alt.Bin(maxbins=70)),
alt.Y('price:Q', bin=alt.Bin(maxbins=70)),
alt.Color('count()', scale=alt.Scale(scheme='blues')))Box plot
alt.Chart(diamonds).mark_boxplot().encode(
alt.X('carat:Q', bin=alt.Bin(maxbins=10)),
alt.Y('price:Q'))Summary: Exploring covariation
| Scenario | Functions |
|---|---|
| Categorical and continuous variable | mark_boxplot() |
transform_density() |
|
| Two categorical variables | size |
color |
|
| Two continuous variables | alt.Size('count()') |
alt.Color('count()') |
|
mark_boxplot() |
|
| binscatter |
Meta comment: iterating on plot design
“Make dozens of plots” – Quoctrung Bui, former 30535 guest lecturer and former Harris data viz instructor
What does he mean?
- The first plot you make will never be the one you should show
- As a rule of thumb, you should try out at least three different plotting concepts (
marks) - Within each concept, you will need to try out several different encodings